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ABSTRACT 

Person-fit statistics have been proposed to investigate the 
fit of an item score pattern to an item response theory (IRT) model. This 
study investigated how these statistics can be used to detect different types 
of misfit. Intelligence test data for 992 people at or beyond college level 
were analyzed using person-fit statistics in the context of the Rasch model 
and Mokken' s IRT models (R. Mokken, 1997). The sensitivity for different types 
of misfit was illustrated. The effect of the choice of an IRT model to detect 
person misfit and the usefulness of person-fit statistics as a diagnostic 
instrument are discussed. (Contains 1 figure, 7 tables, and 52 references.) 
(Author/SLD) 
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Person-fit statistics have been proposed to investigate the fit of an item score pattern 
to an item response theory (IKT) model. I investigated how these statistics can be used 
to detect different types of misfit. Intelligence test data were analyzed using person-fit 
statistics in the context of the Rasch model and Mokken’s IRT models. The sensitivity 
for different types of misfit was illustrated. The effect of the choice of an IRT model to 
detect person misfit and the usefulness of person-fit statistics as a diagnostic instrument 
are discussed. 
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From the responses to the items on a psychological test a total score is obtained that 
reflects a person’s position on the trait that is being measured. The test score, however, 
might be inadequate as a measure of a person’s trait level. For example, a person may 
guess some of the correct answers to multiple-choice items on an intelligence test, thus 
raising his/her total score on the test by luck and not by ability. Similarly, a person not 
familiar with the test format on a computerized test may obtain a lower score than expected 
on the basis of his/her ability level. Inaccurate measurement of the trait level may also 
be caused by sleeping behavior, (e.g., inaccurately answering the first items in a test as a 
result of, for example, problems of getting started), cheating behavior (e.g., copying the 
correct answers of another examinee), and plodding behavior (e.g., working very slowly 
and methodically and, as a result, generating item score patterns which are too good to be 
true given the stochastic nature of a person’s response behavior under the assumption of 
most test models). Other examples can be found in Wright and Stone (1979, pp. 165-190). 

It is important that a researcher has at his or her disposal methods that can help to 
judge if the item scores of an individual are determined by the construct that is being 
measured. Person-fit statistics have been proposed that can be used to investigate whether 
a person answers the items according to the underlying construct the test measures or that 
other answering mechanisms apply (e.g., Levine & Drasgow, 1985; Meijer & Sijtsma, 
1995; 2001; Smith, 1985, 1986; Wright & Stone, 1979). Most statistics are formulated in 
the context of item response theory (IRT) models (Embretson & Reise, 2000; Hambleton 
& Swaminathan, 1985) and are sensitive to the fit of an individual score pattern to a 
particular IRT model. IRT models are useful in describing the psychometric properties 
of both aptitude and personality measures and are widely used both in psychological and 
educational assessment. 

An overview of the existing person-fit literature (Meijer & Sijtsma, 2001) suggests 
that most person-fit studies have focused on theoretical development of person-fit 
statistics and the power of these statistics to detect misfitting item score patterns under 
varying testing and person characteristics. Most often, simulated data were used that 
enabled the researcher to distinguish between misfitting and fitting score patterns and 
thus to determine the power of a person-fit statistic. Also, in most studies a dichotomous 
decision was made whether the complete item score pattern fit or did not fit the IRT model. 
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However, it may also be useful to obtain information about the subsets of items a person 
gives unexpected responses, which assumptions of an IKT model have been violated, and 
how serious the violation is (e.g., Reise, 2000; Reise & Flannery, 1996). Answers to these 
questions may allow for a more diagnostic approach leading to a better understanding of a 
person’s response behavior. Few studies illustrate systematically the use of these person- 
fit statistics as a diagnostic tool. 

The aim of this paper is to discuss and apply a number of person-fit statistics proposed 
for parametric IRT and nonparametric IRT models. In particular, I will apply person-fit 
statistics in the context of the Rasch (1960) model and Mokken’s (197 1 ; 1997) IRT models 
that can be used to diagnose different kinds of aberrant response behavior. Within the 
context of both kinds of IRT models, person-fit statistics have been proposed that can 
help the researcher to diagnose the item scores. Existing studies, however, are conducted 
using either a parametric or nonparametric IRT model, and it is unclear how parametric 
and nonparametric person-fit statistics relate to each other. To illustrate the diagnostic use 
of person-fit statistics I will use empirical data from an intelligence test in the context of 
personnel selection. 

This paper is organized as follows. First, I will introduce the basic principles of IRT 
and discuss parametric and nonparametric IRT. Because nonparametric IRT models are 
relatively unknown, I will discuss nonparametric models more extensively. Second, I 
will introduce person-fit statistics and person tests that are sensitive to different types of 
misfitting score patterns for both parametric and nonparametric IRT modeling. Third, I 
will illustrate the use of the statistics to study misfitting item score patterns associated with 
misunderstanding instruction, item disclosure, and random response behavior using the 
empirical data of an intelligence test. Finally, I will give the researcher some suggestions 
as to which statistics can be used to detect specific types of misfit. 

IRT models 



Parametric IRT 

Fundamental to IRT is the idea that psychological constructs are latent, that is, not directly 
observable and that knowledge about these constructs can only be obtained through the 
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manifest responses of persons to a set of items. ERT explains the structure in the manifest 
responses by assuming the existence of a latent trait (9) on which persons and items have 
a position. IKT models allow the researcher to check if the data fit the model. The focus 
in this article is on IRT models for dichotomous items. Thus, one response category is 
positively keyed (item score 1), whereas the other is negatively keyed (item score 0); 
for ability and achievement items these response categories are usually ’’correct” and 
’’incorrect”, respectively. 

In IRT, the probability of obtaining a correct answer on item g (g = 1, ..., k) is a 
function of 9 and characteristics of the item. This conditional probability P g ( 9 ) is the item 
response function (IRF). Item characteristics that are often taken into account are the item 
discrimination (a), the item location ( b ), and the pseudo-chance level parameter (c). The 
item location b is the point at the trait scale where the probability of a correct response is 
0.5. The greater the value of the b parameter, the greater the ability that is required for an 
examinee to have a 50% chance of correctly answering the item; thus the harder the item. 
Difficult items are located to the right or the higher end of the ability scale; easy items 
are located to the left of the ability scale. When the ability levels are transformed so their 
mean is 0 and their standard deviation is 1, the values of b vary typically from about -2 
(very easy) to +2 (very difficult). The a parameter is proportional to the slope of the IRF 
at the point b on the ability scale. In practice, a ranges from 0 (flat IRF) to 2 (very steep 
IRF). Items with steeper slopes are more useful for separating examinees near an ability 
level 9. The pseudo-chance level parameter c (ranging from 0 to 1) is the probability of a 
1 score for low-ability examinees (that is, 9 — > — oo). 

In parametric IRT, P g {9) often is specified using the 1-, 2-, or 3-parameter logistic 
model (1-, 2-, 3PLM). The 3PLM (Lord & Novick, 1968, Chaps. 17-20) is defined as 



(1 - Cg) exp[a g (9 - 6 g )] 
1 + exp[a 9 (0 - b g )\ 



( 1 ) 



The 2PLM can be obtained by setting c g = 0 for all items; and the 1PLM or Rasch (1960) 
model can be obtained by additionally setting a g = 1 for all items. In the 2- and 3PLM 
the IRFs may cross, whereas in the Rasch model the IRFs do not cross. An advantage of 
the Rasch model is that the item ordering according to the item difficulty is the same for 
each 9 value, which facilitates the interpretation of misfitting score patterns across 9. 
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Most IRT models assume unidimensionality, and a specified form for the IRF which 
can be checked empirically. Unidimensionality means that the latent space that explains 
the person’s test performance is unidimensional. Related to unidimensionality is the 
assumption of local independence. Local independence states that the responses in a test 
are statistically independent conditional on 6. Thus, local independence is evidence for 
unidimensionality if the IRT model contains person parameters on only one dimension. 

For this study it is important to understand that IRT models are stochastic versions of 
the deterministic Guttman (1950) model. The Guttman model is defined by 

6 < b g «-* Pg(8) = 0 (2) 



and 



6 > b g «-» Pg(6) = 1. (3) 

The model thus excludes a correct answer on a relatively difficult item and an 
incorrect answer on an easier item. The items answered correctly are always the easiest 
or most popular items on the test. These principles are not restricted to items concerning 
knowledge, but also apply to the domains of intelligence, attitude, and personality 
measurement. 

This view of test behavior leads to a deterministic test model, in which a person should 
never answer negatively to an item when he/she answers correctly to a more difficult item. 
An important consequence is that given the total score, the individual item responses can 
be reproduced. On the person level this implies the following. When I assume throughout 
this paper that the items are ordered from easy to difficult, it is expected on the basis of 
the Guttman model that given a total score X + , the correct responses are given on the 
first X + items and the incorrect responses are given on the remaining k — X + items. Such 
a pattern is called a Guttman pattern. A pattern with all correct responses in the last X + 
positions and incorrect responses in the remaining positions is called a reversed Guttman 
pattern. As Guttman (1950, p. 64) observed, empirically obtained test data are often not 
perfectly reproducible. In IRT models, the probability of answering an item correctly is 
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between 0 and 1 and thus errors in the sense that an easy item is answered incorrectly 
and a difficult item is answered correctly are allowed. Many reversals, however, point to 
aberrant response behavior. 

Parametric Person-Fit Methods 

Model-data fit can be investigated on the item or person level. Examples of model-data 
fit studies on the item level can be found in Thissen and Steinbeig (1988), Meijer, Sijtsma, 
and Smid (1990), and Reise and Waller (1993). Because the central topic in this study is 
person fit, I will not discuss these item-fit statistics. The interested reader should refer to 
the references above for more details. 

Although several person-fit statistics have been proposed, I will discuss two types of 
fit statistics that can be used in a complementary way: (a) statistics that are sensitive to 
violations against the Guttman model and (b) statistics that can be used to detect violations 
against unidimensionality. I will use statistics that can be applied without modeling an 
alternative type of misfitting behavior (Levine & Drasgow, 1988). Testing against a 
specified alternative is an option when the researcher knows what kind of misfit to expect. 
This approach has the advantage that the power is often higher than when no alternative 
is specified. The researcher, however, is often unsure about the kind of misfit to expect. 
In this situation, the statistics discussed below are useful. 

There are several person-fit statistics that can be applied using the Rasch model. I 
illustrate two methods that have well-known statistical properties. One of the statistics is 
a uniformly most powerful test (e.g., Lindgren, 1993, p. 350). 

Violations against the Guttman model. Most person-fit methods in the context of the 
2- and 3PLM are based on determining the likelihood of a score pattern. Many studies 
have been conducted using the log-likelihood function l. Let X g denote the item score on 
item g, then 



A standardized version of this statistic is denoted l z (Drasgow, Levine, and Williams, 
1985; Levine and Rubin, 1979). l z was proposed to obtain a statistic that was less 
confounded with 0 than l, that is, a statistic whose value is less dependent on 9. l z is 



k 




(4) 
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given by 



l- E(l) 

x/v^TO’ 



(5) 



where E ( l ) and Var{l ) denote the expectation and the variance of l, respectively. These 
quantities are given by 



E (i) = £ {P, (0) In [P» (»)] + [1 - P, («)] In [1 - P, (#)]} , (6) 

9=1 



and 



Vnr(O = £P,(0) [1 



p, mi 



In 



9=1 



1 -P a (0) 



(7) 



Large negative l z values indicate aberrant response behavior. Below I will give an 
example of the use of l in the context of the Rasch model. 

To classify an item score pattern as (mis)fitting, a distribution is needed under 
response behavior that fit the IRT model. For long tests (larger than, say, 80 items) and 
true 9 it can be shown that l z is distributed as standard normal. A researcher can specify 
a type I error rate, say a = 0.05, and classify an item score pattern as misfitting when 
l z < —1.65. In practice, however, 9 must be estimated and with short tests this leads 
to thicker tail probabilities than expected under the standard normal distribution which 
results in a conservative classification of item score patterns as misfitting (Nering, 1995, 
1997; Molenaar & Hoijtink, 1990). Therefore, Snijders (2001) derived an asymptotic 
sampling distribution for a family of person-fit statistics like l where a correction factor 
was used for the estimate of 9, denoted 9. For an empirical example using this correction 
factor see Meijer and van Krimpen-Stoop (2001). 

The l or l z statistic is most often used in the context of the 2PLM or 3PLM. Because I 
use the Rasch model to analyze an empirical dataset, a simplified version of l is employed. 
For the Rasch model, l can be simplified as the sum of two terms, 



l — d+ M 



( 8 ) 




*o 

i 



0 



with 
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k 

d=~Y ] ln[l + exp(0 - 6 g )] + 9X + (9) 

5=1 

and 

k 

M = -Y,b g X g . ( 10 ) 

5=1 

Given X+ = X g (that is, given 9, which in the Rasch model only depends on 
the sufficient statistic X+), d is independent of the item score pattern (X g is absent in 
Equation 9) and M is dependent on it. Given X + , l and M have the same ordering in 
the item score pattern, that is, the item score patterns are ordered similarly by M and l. 
Because of its simplicity, Molenaar and Hoijtink (1990) used M rather than l as a person- 
fit statistic. 

To illustrate the use of M consider all possible item score patterns with X + = 2 on 
a five-item test. In Table 1 all possible item score patterns on the test are given with 
their M-values assuming b = (-2, -1, 0, 1, 2). Pattern #1 is most plausible, that is, this 
pattern has the highest probability under the Rasch model. Pattern #10 is least plausible 
under this configuration of item difficulties. Note that the items difficulty parameters as 
explained above are centered around zero. 

Molenaar and Hoijtink (1990) proposed three approximations to the distribution of 
M: (1) complete enumeration, (2) a chi-square distribution, where the mean, standard 
deviation, and skewness of M are taken into account, and (3) a distribution obtained via 
Monte Carlo simulation. For all scores complete enumeration was recommended for tests 
with k < 8, and up to k = 20 in the case of the relatively extreme scores X + = 1 , 2, k — 2, 
k — 1. For other cases a chi-square distribution was proposed with the exception of very 
long tests for which Monte Carlo simulation was recommended. For a relatively simple 
introduction to this statistic and a small application see Molenaar and Hoijtink (1995). 

Insert Table 1 about here 

Statistics like l z and M are often presented as statistics to investigate the general fit of 
an item score pattern. However, note that l and M are sensitive to a specific type of misfit 
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to the IRT model; namely, violations to Guttman patterns. As I discussed above, when the 
items are ordered from easy to difficult, an item score pattern with correct responses in 
the first X+ positions and incorrect responses in the remaining k — X + positions is called 
a Guttman pattern because it meets the requirements of the Guttman (1950) model. To 
illustrate this, consider again the item score patterns in Table 1. The pattern of person #1 
is a perfect Guttman pattern which results in the maximum value of M and the pattern of 
person #10 is the reversed Guttman pattern which results in a minimum value of M. 

As an alternative to M, statistics discussed by Wright and Stone (1979) and Smith 
(1986) can be used. For example, the statistic 



I prefer the use of M because research has shown (e.g., Molenaar & Hoijtink, 
1990) that the actual type I error rates are too sensitive for this statistic to the choice 
of the 9 distribution, item parameter values, and 9 level to be trusted, and the advocated 
standardizations are, in most cases, unable to repair these deficiencies. For example, 
Li and Olejnik (1997) concluded that for both l z and a standardized version of B the 
sampling distribution under the Rasch model deviated significantly from the standard 
normal distribution. Also Motenaar and Hoijtink (1995) found for a standardized version 
of B and a standard normal distribution for 9 using 10,000 examinees that the mean of B 
was -0.13, the standard deviation was 0.91 and the 95% percentile was 1.33 rather than 
1.64 and that thus too few respondents would be flagged as possible aberrant. 

Notations against unidimensionality. To investigate unidimensionality I will check 
whether the ability parameters are invariant over subtests of the total test. In many 
cases, response tendencies that lead to deviant item score patterns cause violations 
of unidimensional measurement, that is, violations against the invariance hypothesis 
of 9 for a suitably chosen subdivision of the test. Trabin and Weiss (1983) discuss, 
in detail, how factors like carelessness, guessing, or cheating may cause specific 
violations against unidimensional measurement when the test is subdivided into subtests 
of different difficulty level. It is also useful to investigate other ways of subdividing 
the test. For example, the order of presentation, the item format, and the item content 
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provide potentially interesting ways of subdivision. Each type of split can be used to 
extract information concerning different response tendencies underlying deviant response 
vectors. 

Using the Rasch model, Klauer ( 199 1) proposed a uniformly most powerful test that is 
sensitive to the violation of unidimensional measurement. The statistical test is based on 
subdivisions of the test into two subtests, A x and A 2 . Let the scores on these two subtests 
be denoted by X x and X 2 and let the latent traits that underlie the person’s responses on 
these two subtests be denoted by 9 X and 9 2 , respectively. Under the null hypothesis (H 0 ), 
it is assumed that the probabilities underlying the person’s responses are in accordance 
with the Rasch model for each subtest. Under the alternative hypothesis (H x ), the two 
subtests need not measure the same latent trait, and different ability level 9 k may underlie 
the person’s performance with respect to each subtest A k . An individual s deviation from 
invariance is given by the parameter Tj — 9\ — 9 2 and Hq \ tj = 0 is tested against 
H x : rj ± 0. To test this hypothesis Klauer (1991) considered the joint distribution of X x 
and a person’s total score, X + . This joint distribution is determined on the basis of the 
Rasch model. Given a prespecified nominal type I error, cut-off scores are determined for 
subtest scores X x . 

Klauer (1991) gives an example for a test consisting of 15 items with difficulty 
parameters between -0.69 and 0.14, where the test is divided into two subtests of seven 
easy and eight difficult items. Let the cut-off scores be denoted by c L (X+ ) and cu(X+) 
For a test score of, for example, X+ = 8 the cut-off scores for X x are c L (X+) = 3 and 
cu{X+) = 6 . Thus, score patterns with X x = 3, 4, 5, or 6 are considered to be consistent 
with the Rasch model. A value of X x outside this range points at deviant behavior. Thus 
if an examinee with X + — 8 has only 2 correct answers on the first subtest (X x = 2) 
and 6 correct answers on the second subtest ( X 2 = 6), the item score pattern will be 
classified as aberrant. Also, note that an item score pattern with X x = 7 and X 2 = 1 will 
be classified as aberrant. For this score pattern there are too many correct answers on the 
first subtest and too few correct answers on the second subtest given the stochastic nature 
of the Rasch model. 

The test statistic equals 
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X 2 = -2\og(p(X k \X + )) (12) 

where X k is the subtest score andp(X fc |X+) is the conditional probability discussed above 
of observing deviations this large or even larger from the invariance hypothesis. Under 
the Rasch model X 2 is chi-squared distributed with two degrees of freedom. 

To illustrate the use of this statistic I consider an empirical example given by Klauer 
(1991). A verbal analogies test consisting of 47 items was divided into two subtests of 
23 easy and 24 difficult items. Then a person with a total score of 10 but only 3 correct 
answers on the easy subtest obtained X 2 = 17.134 with p(Xfc|X + ) = 0.0003. For this 
person, the invariant hypothesis was violated. Power curves for statistics analogous to 
X 2 can be found in Klauer (1995). 

Note that the statistic X 2 and M differ on the types of violations to the Rasch model. 
The observed item score patterns is considered inconsistent with the Rasch model if 
M < c(A+), where c(X + ) is a cut-off score that depends on the test score X + associated 
with the response vector and on the chosen a-level. For example, assume that in Table 
1 c(X + ) = —2, then item patterns # 9 and #10 are classified as inconsistent with the 
Rasch model. Assume now as an alternative hypothesis, H\, that a few examinees exhibit 
misfitting test behavior described above as ’’plodding” behavior, which probably result 
in an almost perfect Guttman pattern like pattern #1. In this case maximal values of the 
statistic M are obtained, indicating perfect fit. In contrast, the statistic X 2 will flag such 
a pattern as misfitting when I split the test into a subtest with the first ft/ 2 items and a 
subtest with the second ft/2 items (assuming an even number of items). This pattern is 
inconsistent with the Rasch model because X\ on the easiest subtest is too high and the X^ 
on the second subtest is too low given the assumptions of the (stochastic) Rasch model. 

Important is that in this study I will calculate X 2 on the basis of the item difficulty 
ordering in the test, X 2 will then be denoted by X\p and on the basis of the presentation 
order of the items in the test, denoted X^. d . 

Nonparametric IRT 

Although parametric models are used in many IRT applications, nonparametric IKT 
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models are becoming more popular (e.g., Stout, 1990). For a review of nonparametric 
IKT, see Sijtsma (1998) or Junker and Sijtsma (2001). In this study I will analyze the 
data by means of the Mokken (1971) models. I use these models because they are 
popular nonparametric IRT models (e.g., Mokken, 1997; Sijtsma, 1998). There is also 
a user friendly computer program, MSP5 for windows, to operationalize these models 
(Molenaar & Sijtsma, 2000). 

The first model proposed by Mokken (1971; 1997; see also Molenaar, 1997) is 
the model of monotone homogeneity (MHM). This model assumes unidimensional 
measurement and an increasing IRF as function of 9. However, unlike parametric IRT, 
the IRF is not parametrically defined and the IRFs may cross. The MHM allows the 
ordering of persons with respect to 9 using the unweighted sum of item scores. In many 
testing applications, it often suffices to know the order of persons on an attribute (e.g., in 
selection problems). Therefore, the MHM is an attractive model for two reasons. First, 
ordinal measurement of persons is guaranteed when the model applies to the data. Second, 
the model is not as restrictive with respect to empirical data as the Rasch model and thus 
can be used in situations where the Rasch model does not fit the data. 

The second model proposed by Mokken (1971) is the model of double monotonicity 
(DMM). The DMM is based on the same assumptions as the MHM plus the additional 
assumption of nonintersecting IRFs. Under the DMM it is assumed that when k items are 
ordered and numbered, the conditional probabilities of obtaining a positive response are 
given as, 



P 1 (9) > P 2 (9) > ... > P k (9), for all 9. (13) 

Thus the DMM specifies that, except for possible ties, the ordering is identical for all 9 
values. Note that ordering of both persons and items is possible when the DMM applies, 
however, attributes and difficulties are measured on separate scales. Attributes are 
measured on the true score scale, and difficulties are measured on the scale of proportions. 
Thus, persons can be ordered according to their true scores using the total score. For the 
Rasch model, measurement of items and persons takes place on a common difference or 
a common ratio scale. 
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An important difference between the Mokken and the Rasch models is that the IRFs 
for the Mokken models need not be of the logistic form. This difference makes Mokken 
models less restrictive to empirical data than the Rasch model. Thus the Mokken models 
can be used to describe data that do not fit the Rasch model. 

In Figure 1 examples are given of IRFs that can be described by the MHM, DMM, 
and the Rasch model. 



Insert Figure 1 about here 

This difference also suggests that an item score pattern will be less easily classified as 
misfitting under the MHM than under the Rasch model because, under a less restrictive 
model, item score patterns are allowed that are not allowed under a more restrictive model. 
To use an extreme example, assume the data are described by the deterministic Guttman 
model. In the case of the Guttman model every item score pattern with an incorrect 
score on an easier item and a correct score on a more difficult item will be classified 
as misfitting. 

Nonparametric person-fit statistics 

Person-fit research using nonparametric IRT has been less popular than with the 
parametric IRT modeling. Although several person-fit statistics have been proposed 
which may be used in nonparametric IRT modeling, Meijer and Sijtsma (2001) conclude 
that one of the drawbacks of these statistics is that the distribution of the values of these 
statistics is dependent on the total score and that the distributional characteristics of these 
statistics conditional on the total score is unknown. 

Van der Flier (1982) proposed a nonparametric person-fit statistic C/3 that was 
purported to be normally distributed (see e.g., Meijer, 1998). Recently, Emons, Meijer, 
and Sijtsma (in press) investigated the characteristics of this statistic and concluded that 
in most cases studied there was a large discrepancy between the type I error rate under the 
theoretical sampling distribution and the empirical sampling distribution. This finding 
makes it difficult to interpret person-fit scores based on this statistic. 

Another nonparametric person-fit method was proposed by Sijtsma and Meijer 
(2001). They discussed a statistical method based on the person response function (PRF, 
Trabin & >^feiss, 1983; Reise, 2000). The PRF describes the probability that a respondent 
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with fixed 6 correctly answers items from a pool of items with varying location. For 
the Rasch model the PRF can be related to the IRF by changing the roles of 6 and b 
where b is treated as a random item variable and 6 is treated as fixed. The PRF is a 
nonincreasing function of item difficulty, that is, the more difficult the item the smaller 
the probability that a person answers an item correctly. Under the DMM, local deviations 
from nonincreasingness can be used to identify misfit. 

To detect local deviations an item score pattern is divided into G subtests of items, 
so that Ai contains the m easiest items, A 2 contains the next m easiest items, and so 
on. Let subtests A g of m increasingly more difficult items be collected in mutually 
exclusive vectors A g such that A = (Ai,A 2 , A G ). Consider newly defined vectors 
each of which contains two adjacent subsets A g and A g+ \ : A(i) = (A\,A 2 ), 
Ay) = (^ 2 ,^ 3 ), ..., j4(g— 1 ) = {Ag-i,Ag). The statistical method applies to each pair 
in A( g y A useful question in person-fit research is whether the number of correct answers 
on the more difficult items (denoted X +d ) is exceptionally low given the total score and the 
subtest score on the easier subtest (denoted X +e ). To test this, Sijtsma and Meijer (2001; 
see also Rosenbaum, 1987) showed that for each pair of subtest scores a conservative 
bound based on the hypergeometric distribution can be calculated (denoted V) for the 
probability that a person has, at most, X +e ls on the easiest subtest. If, for a particular 
pair, this probability is lower than, say 0.05, then the conclusion is that X +e in this pair is 
unlikely given X +d and X + . A conservative bound means that the probability under 
the hypeigeometric distribution is always equal or larger than under a nonparametric 
IRT model. Thus, if V = 0.06 is found, then under the IRT model this probability is 
smaller than 0.06. Although this method is based on item response functions that are 
nonintersecting, Sijtsma and Meijer (2001) showed that the results are robust against 
violations of this assumption. Furthermore, they investigated the power of V to detect 
careless response behavior. Detection rates ranged from .018 through .798 depending on 
the 6 level. 

To illustrate the use of this method assume that the items are ordered from easy 
to difficult and consider a person who generates the following item score pattern 
(11110|00000). This person has four correct answers on the five easiest items which is as 
expected under the DM model and the cumulative probability under the hypergeometric 
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distribution equals 1.00. However, if I consider a pattern (00000| 10111) then this 
cumulative probability equals .0238 and it is concluded that at a 5% level this item score 
pattern is unlikely. Note that this method takes into account the subtest score, and not the 
different permutations of zeros and ones within a subtest. In the Appendix the calculations 
are given. 

Sijtsma and Meijer (2001) did not relate V to specific types of misfit. However, 
if I compare this method with M and X 2 , it can be concluded that V is sensitive to 
violations of unidimensionality when the test is split into subtests according to the item 
difficulty and for subtests with laige values of X +d in combination with small values of 
X +e . However, it is insensitive to violations of unidimensionality when X +e is larger than 
X +d . Also, random response behavior will not be detected easily because the expected 
X + is approximately the same on each subtest, which will not result in small values of 
V. Thus, the method proposed by Sijtsma and Meijer (2001) can be considered a special 
case within a nonparametric ERT framework of the method proposed by Klauer (1991), 
as previously discussed. 

Method 

To illustrate the usefulness of the statistics to detect different types of misfit I used 
empirical data of the Test for Nonverbal Abstract Reasoning (TN\A, Drenth, 1969). The 
test consist of 40 items and the test is speeded. A sample of 992 persons was available. 
The test is developed for persons at or beyond college/university level. For the present 
analysis the sample consisted of high school graduates. The dataset was collected for 
personnel selection in computer occupations, such as systems analyst and programmer. I 
first investigated the fit of this dataset using the computer program RSP (Glas and Ellis, 
1993) and MSP5 for Windows (Molenaar & Sijtsma, 2000). The program RSP contains 
fit statistics for monotonicity and local independence that are improved versions of the 
statistic proposed by Wright and Panchepakesan (1969; see also Smith, 1994). It also 
contains statistics for testing the null hypothesis that the IRFs are logistic with equal slopes 
against the alternative that they are not. I concluded that the items reasonably fit the Rasch 
model and the Mokken models. Details about this fit procedure can be obtained by the 
author. Before I analyze the item score patterns, I will first discuss some possible types 
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of person misfit that can occur on the TN\A. 

Types of Possible Misfit in Intelligence lasting 

Misunderstanding of instruction^ sing a suboptimal strategy. 

To maximize performance on the TN\A, examinees must work quickly because there is 
a limited amount of time to complete the test. However, some examinees may focus 
on accuracy, afraid of answering an item incorrectly and, as a result, may generate a 
pattern where the answers on the first items are (almost) all correct and on the later items 
are (almost) all incorrect (’’plodding” behavior as discussed in Wright & Stone, 1979). 
This strategy is especially problematic for the TN\A because the first items are relatively 
difficult (for example item #5 has a proportion-correct score equal to 0.55). A person 
working very precisely will spend too much time answering the first items and the total 
score on the test may then underestimate his/her true score. 

As a result of this type of misfit it is expected that given X + the test score on the first 
part of the test is too large and on the second part too low given the stochastic nature of 
an IRT model. X° rd can be used to detect this type of misfit whereas M is less suitable. 
Xj lf will only be sensitive to this type of misfit when the item presentation ordering 
matches the item difficulty ordering in the test. V is not useful here. Remember that V 
assumes that the items are ordered from easy to difficult and that the method is sensitive to 
unexpected low subtest scores on an easy subtest in combination with high subtest scores 
on a more difficult subtest. Working very precisely, however, will result in the opposite 
pattern; namely, high subtest scores on the easy subtest and low subtest scores on the more 
difficult subtest. 

Item Disclosure 

When a test is used in a selection situation with important consequences for individuals, 
persons may be tempted to obtain information about the type of test questions or even 
about the correct answers to particular items. In computerized adaptive testing this is 
one of the major threats to the validity of the test scores, especially when items are used 
repeatedly. This is also a realistic problem for paper-and pencil tests. The TN\A is often 
used in the Netherlands by different companies and test agencies because there are no 
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alternatives. Item exposure is therefore realistic and may result in a larger percentage of 
correct answers than expected on the basis of the trait that is being measured. 

Detection of item exposure by means of a person-fit statistic is difficult because for 
a particular person it is often unknown which items are known and how many items are 
known. Advance knowledge of a few items will only have a small effect on the test score 
and is therefore not so easily detected. Also, advance knowledge of the easiest items will 
only have a small effect on the test score. This outcome suggests that advance knowledge 
of the items of median difficulty and of the most difficult items may have a substantial 
effect on the total score, especially for persons with a low 9 level. This type of misfit may 
be detected by fit statistics like M, X di f, and V. Note that X° rd will only be sensitive 
to item disclosure when the item difficulty order matches the order of presentation in the 
test. 

Note that item disclosure has some similarities with response distortion on personality 
scales (Reise & Waller, 1993). Although the mechanisms beyond those two types of 
misfit are different, the realized item scores patterns may be similar: both are the 
result of faking and both have the effect of increasing the test score. Recently, Zickar 
and Robie (1999) investigated the effect of response distortion on personality scales 
for applicants in selection situations. Response distortion occurs when people present 
themselves in a favorable light so that they make a more favorable impression, for 
example, in selection situations where personality tests are used to predict job success, job 
effectiveness, or management potential. Zickar and Robie (1999) examined the effects of 
experimentally-induced faking on the item- and the scale-level measurement properties 
of three personality constructs that possess potential utility for use in personnel selection 
(Work orientation, Emotional Stability, and Nondelinquency). Using a sample of military 
recruits, comparisons were made between an honest condition and both an ad-lib (i.e., 
’’fake good” with no specific instructions on how to fake) and coached faking (i.e., ’’fake 
good” with instructions on how to ’’beat” the test) condition. Faking had a substantial 
effect on the personality latent trait scores and resulted in a moderate degree of differential 
item functioning and differential test functioning. These results suggests that identifying 
faking is important. 




r 

c. 



0 



IRT person fit - 19 



Random response behavior 

Random response behavior may be the result of different underlying mechanisms 
depending on the type of test and the situation in which the test is administered. One 
reason for random response behavior is lack of motivation. Another reason is lack of 
concentration which may result from feeling insecure. In intelligence testing random 
response behavior may occur as a result of answering questions too quickly. Xj lf , X^. d 
and M may be sensitive to random response behavior, depending on which items are 
answered randomly. V seems to less suited to detect random response behavior because 
V is sensitive to unexpected low subtest scores on an easy subtest in combination with 
high subtest scores on a more difficult subtest. Random response behavior as a result of 
answering questions too quickly will probably result in similar number-correct scores for 
different subtests. 

In Table 2 the sensitivity of the statistics to different types of deviant response 
behavior is depicted, where a ”+” sign indicates that the statistic is sensitive to the 
response behavior and a sign indicates that the statistic is expected to be insensitive 
to the response behavior. The reader, however, should be careful when interpreting the” 
+” and signs. For a particular type of deviant behavior, the sensitivity of a statistic 
depends on various factors, like, for example, whether the presentation ordering of the 
items is in agreement with the difficulty order. If this is the case both X di j and X^ are 
sensitive to ’’Misunderstanding of Instruction”. 



To illustrate the validity of the different fit statistics I conducted a small simulation 
study. 10,000 item score patterns were simulated on a 40-items test using the Rasch 
model with item difficulties drawn from a uniform distribution between [-2,2] and 9 drawn 
from a standard normal distribution. The three types of aberrant response behavior were 
simulated as follows. The use of a suboptimal strategy was simulated by changing the 
item scores on the first 10 items: the probability of responding correctly to these items 
was .90, whereas the probability equalled .10 for the remaining items. Item disclosure 
was simulated by changing the item scores on the 5 most difficult items, the probability 
of answering an item correctly was .90, and random response behavior was simulated by 
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changing the answers on the 10 most difficult items so that the probability of answering an 
item correctly equalled .20. The proportion correctly identified misfitting score patterns 
for the different statistics is given in Table 3. As expected X^. d had 



the highest power for detecting a suboptimal strategy, whereas V was insensitive to this 
type of response behavior. X^. d was insensitive to item disclosure and V was insensitive 
to random response behavior. 

Procedure 

Person-fit statistics 

The program RSP (Glas & Ellis, 1993) was used to compute the p - values (significance 
probabilities) belonging to the M-statistic (Equation 10). A program developed by Klauer 
(1991) was used to calculate the X 2 person-fit statistic and a program developed in S-plus 
(2000) was used to determine V. This program can be obtained from the author. 

In the person-fit literature there has not been much debate about the nominal type I 
error rate. In general, I prefer to choose a relatively large a, for example a = .05 or 
a = .10, for two reasons. The first reason is that existing person-fit tests have relatively 
low power due to limited test length and a relatively small number of observations. 
Choosing a small value for a (e.g., a = .001) will then result in a very low number 
of misfitting score patterns. The second reason is that, in practice, persons will never be 
withheld a total score solely on the basis of the fit of their score pattern to an IRT model 
(i.e., having an extreme person-fit score). A person-fit statistic will alert the researcher 
that a person’s response behavior is unexpected and that it may be interesting to study the 
item score pattern more closely. This implies that, in practice, incorrectly rejecting the 
null hypothesis of fitting response behavior has no serious consequences. In this study I 
will use a = 0.10, a = 0.05, and a — 0.01. 
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Results 



Descriptive statistics 

The mean number-correct score on the TNVA was 27.08 with a standard deviation of 5.73. 
There were no persons who answered all items incorrectly and there was one person who 
answered all 40 items correctly. The item proportion correct scores varied between 0.11 
(item 40) through 0.99 (item 4). 



Person-Fit results 

Items ordered according to the item difficulty. 

Table 4 shows the proportion of score patterns classified as misfitting (detection rates) for 
M, X dif , X^. d , and V. The results using X will be discussed in the next paragraph. 
Note that M is based on the complete item score 

Insert Table 4 about here 

pattern, whereas the X 2 fit statistics and V are based on a split of the item score patterns 
into subtests. By means of X 2 and V, the number-correct scores on two subtests are 
compared. I first split the test into a subtest containing the 20 easiest items and a 
subtest containing the 20 most difficult items. Table 4 shows that for M the proportion 
of misfitting item score patterns was somewhat higher than the nominal a-levels in all 
conditions. For the X dif statistic the proportions at a = 0.05 and a = 0.10 were higher 
than for the M-statistic. 

To calculate V I first ordered the items according to decreasing item proportion correct 
score and split the test into two subtests of 20 items each. Table 4 shows that no item score 
pattern was classified as misfitting. Therefore, I split the test into 4 subtests of 10 items 
each, where subtest #1 contained the easiest items, subtest #2 contained the next 10 easiest 
items, and so on. This was done because Sijtsma and Meijer (2001) showed empirically 
that the power of V often increases when smaller subtests are used. Calculating V using 
subtest #1 and subtest #2, however, no significant results were obtained. Thus, the 
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proportion of item score patterns classified as misfitting was much higher using Xj if 
than using V. For example, for a = 0.05 using Xj^, I found a proportion of 0.093 
and for a = 0.10 I found a proportion of 0.149. V is only sensitive to score patterns 
with unexpected 0 scores on the easier subtest in combination with unexpected 1 scores 
on the more difficult subtest. Therefore, I also determined the detection rates using 
Xj if when the total score on the second subtest was unexpectedly high compared to 
the second subtest. For the split into two subtest of 20 items and a = 0.01, 1 found a 
proportion of 0.012; for a = 0.05 a proportion of 0.043 and for a = 0.10 a proportion 
of 0.051. Although the proportions are decreasing for this particular type of misfit, Xj if 
still classifies a higher proportion of item score patterns as misfitting than V. 

In Table 5 the four most deviant score patterns according to the Xj if test are given 
with their corresponding p — values for the X'^j test and the M-statistic. Person # 754 
with X + = 12 produces a very unexpected item 

Insert Table 5 about here 

score pattern because only 4 out of the 20 easiest items are answered correctly, whereas 
on the 20 most difficult items 8 items are answered correctly. This pattern was also 
classified as unexpected using M. To further illustrate this unexpected behavior, item 
#1 with a proportion correct score of 0.99 was answered incorrectly, whereas item #39 
with a proportion correct score of 0.19 was answered correctly. Consider the types of 
aberrant response behavior discussed above to interpret the type of misfit. It is clear that 
’’plodding” and ’’item disclosure” are very unlikely and that this response pattern may be 
the result of ’’random response behavior”. Note that the expected X + for a person who is 
completely guessing on this test with 40 items with four alternatives per item equals 10, 
almost equal to the total score of person #754 with X + = 12. 

The pattern of person #754 was classified as fitting using V. Dividing the test into 
two subtests of 20 items each resulted in V = 0.150. Also dividing the test into four 
subtests of 10 items each and comparing the first two subtests resulted in a nonsignificant 
result: V = 0.500. 

The difference between the X 2 statistics and M can be illustrated by the item score 
pattern of person # 841. This person with X + = 19, answers 19 out of the 20 easiest 
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items correctly, and answers the 20 most difficult items incorrectly. This is unexpected on 
the basis of the probabilistic nature of the Rasch model. Thus I expect less correct scores 
on the easier subtest and more correct answers on the second subtest given = 19. For 
this pattern the Xj if test has p = 0.0004. However, for the M statistic p = 0.4521 and 
V = 1.000. This illustrates that both M and V are insensitive and X dlf is sensitive 
to violations of unidimensionality.. To interpret this response behavior consider also 
the value of X because this type of answering behavior has many similarities with 
"plodding behavior”. For person # 841 using X^. d p = 0.0019. Thus on the basis of the 
item ordering, this pattern is very unlikely. This pattern is a good example of someone who 
is working very slowly but precisely, and uses a suboptimal strategy to obtain a maximal 
score given his/her ability. If he/she had guessed the answers on the more difficult items 
later in the test, the test score would have been higher. 

Note that the item score pattern of person #669 is also very unlikely. This person 
has 14 correct answers on the first 20 items and 14 correct answers on the second most 
difficult items. Thus, given X + = 27, too many correct answers on the difficult items and 
too few correct answers on the easier items. This pattern is more difficult to interpret than 
the other examples when only relying on the item difficulty order. Using X^ rd p = 0.0141; 
thus on the basis of the item presentation order this pattern is less unlikely. Item preview 
of the more difficult items may be an explanation, or it may be someone who is working 
very fast and easily skips items. 

Items ordered according to the rank order of presentation in the test 
Different types of score patterns may be classified as misfitting depending on whether I 
use the presentation order on the test or the item difficulty order. I will only discuss the 
difference between X dif and X^ because M is not influenced by the ordering of the 
items, and V is only based on the item difficulty order. 

Because the order of the items in the TNV\ according to presentation and difficulty 
order are partly overlapping, some item score patterns are classified as misfitting by both 
X 2 statistics (see below). However, there are exceptions. In Table 6, it can be seen that 
person #5, #827, and #101 produce item score patterns with many 
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correct item scores on the first subtest and many incorrect scores on the second subtest. 
The ratio of the test scores on the two subtests for these persons was 16/0, 20/6, and 19/2. 
This indicates people who work slowly but conscientiously. Note, that for these people, 
X%if p < 01, whereas using X 2 rd p > 0.05. When I combine the information obtained 
from both statistics I conclude that these people answer both difficult items and easy items 
correctly (if not, they would have had very low p-values using X di j) and mainly answer 
the first items in the test correctly. This is strong evidence for ’’plodding” behavior. 

Overlap of the statistics 

In Table 7 the overlap of the number of item score patterns classified as misfitting at 
a = 0.05 by the different person-fit statistics is given. This table illustrates that the three 

Insert table 7 about here 

statistics are sensitive to different types of misfit. Note, however, that of the 69 patterns 
classified as misfitting by M only 22 are also classified as misfitting by X and only 19 
are classified as misfitting using X dif . For X 2 , using the item difficulty- or presentation 
order has an effect: the overlap of the number of items score patterns classified as 
misfitting is only 30, whereas 88 patterns are classified as misfitting using X^. d and 93 
patterns are classified as misfitting using X £. d . 

Discussion 

In this study, the item score patterns on an intelligence test were analyzed using both 
parametric and noriparametric person-fit statistics. A researcher uses an ERT model 
because it fits the data or because it has some favorable statistical characteristics. 
Nonparametric IRT models have the advantage that they more easily fit an empirical 
dataset than parametric IRT models. Which model should be used to analyze item score 
patterns ? From the results obtained in this study, I tentatively conclude that a parametric 
IRT model may be preferred to a nonparametric IRT model. 

Using different kinds of person-fit statistics with the Rasch model resulted in a higher 
percentage of item score patterns classified as misfitting compared to Mokken’s IRT 
models and, perhaps more interesting, it resulted in different kinds of misfit. In this 
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study, V was too conservative in classifying an item score patterns as misfitting. For 
a = 0.05 no item score patterns were classified as misfitting. Because V only gives a 
upper bound to the significance probability I conclude that for a set of items for which 
the IRFs do not intersect a parametric approach may be preferred to a nonparametric 
approach. As Junker (2001) noted, ’’the Rasch model is a very well-behaved exponential 
family model with immediately understandable sufficient statistics (...). Much of the 
work on monotone homogeneity models and their cousins is directed at understanding 
just how generally these understandable, but no longer formally sufficient, statistics yield 
sensible inferences about examinees and items”. If I interpret ’’sensible inferences” here 
as ’’inferences about misfitting response behavior” it seems more difficult to draw these 
inferences using nonparametric ERT than using parametric ERT. This finding suggests 
that future research should be aimed at improving the power of nonparametric person-fit 
statistics, and specifying under which conditions these methods can be used as alternatives 
to parametric person-fit statistics. 

The researcher must also decide which logistic IRT model to use. Person-fit statistics 
detect small numbers of outliers that do not fit an IRT model. When the Rasch model 
shows a reasonable fit to the data, this model can detect person-fit because powerful 
statistical tests are available for this model compared to other ERT models (Meijer & 
Sijtsma, 2001). Moreover, I could distinguish different types of misfit using different 
kind of statistics. Although I used data from an intelligence test, the person-fit statistics 
illustrated in this paper could also be applied to data from personality testing to detect 
aberrant response patterns related to different test taking behaviors. 
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Appendix 



For each pair of subtest scores a conservative bound based on the hypergeometric 
distribution can be calculated for the probability that a person has at most X +e = x +e Is 
on the easiest subtest. Let J denote the number of items in the two subtests, let J d denote 
the number of items in the most difficult subtest and let J e denote the number of items 
in the easiest subtest. The cumulative hypergeometric probability has to be calculated 
bearing in mind that if X + > J d the minimum possible value of X +e is X + — J d ■ Thus 

X+e 

P = P(X +e < x +e | J, J e , X+) = = W \ J ' J *> X +) 

w=max(0,X+ — Jd ) 

To illustrate this, assume that I have the score pattern (1110100000) and that the items are 
ordered according to increasing item difficulty, thus the first 5 items in the item ordering 
are the easiest items and the second 5 items are the most difficult items. Suppose that I 
would like to compare the subtest score on the first 5 easiest items ( x +e = 4) with the 
subtest score on the second 5 most difficult items. Then 

4 

P = P(X +e < 4 \J = 10, J e = 5, X + = 4) = p ( x +e = w\j = 10, J e = 5,X+ = 4) 

10=0 



II 



5 VS) 



+ + = 0.0238 + 0.2381 + 0.4762 + 



( 5 )( 5 ) 

which equals P — -r ~r — p®j 

0.2381 + 0.0238 = 1 However, if we consider the item score pattern (0000010111) then 



the cumulative hypergeometric probability equals 



ID 



0 



= 0.0238. 



Figure Captions 



Figure 1 . Examples of IRFs for different IRT models. 
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Table 1. M-values for different item score patterns 
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Table 2 

Sensitivity of person-fit statistics to different types of deviant behavior, 
a “+” denotes sensitive and a denotes insensitive to a particular type of 
deviant behavior 
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Table 3 

Detection rates for different types of deviant 
response behavior 
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Table 4 

Detection rates for different person-fit statistics 
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Table 5 

Significance probabilities for the four most deviant item score patterns according to 
* 2 dif with corresponding M values (items ordered according to increasing item 
difficulty; a superscript denotes the number of consecutive Os or Is thus 1 2 0 3 denotes 
11000) 
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Tab,e6 2 
Significance probabilities for item score patterns classified as misfitting using X or d 

and fitting using X 2 dif (item scores ordered according to the order in the test; a 

superscript denotes the number of consecutive Os or is thus l^) 3 denotes 1 1000) 
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Table 7 

Overlap of the number of item score patterns classified as misfitting at a =0.05 
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